Why do people keep parsing HTML using regex? [closed]

Posted by polygenelubricants on Stack Overflow See other posts from Stack Overflow or by polygenelubricants
Published on 2010-05-03T09:47:44Z Indexed on 2010/05/04 3:38 UTC
Read the original article Hit count: 284

Filed under:

regex

|

best-practices

|

subjective

|

html-parsing

|

industry-standard

As much as I love regular expressions, it's obvious to me that it's not the best tool for parsing HTML, especially given the numerous good HTML parsers out there.

And yet there are numerous questions on stackoverflow that attempts to parse HTML using regex. And people would always point out what a bad idea that is in the comments. And the accepted answer would often have a disclaimer how this isn't really the ideal way of doing things.

But based on the constant flow of questions, it still seems that people keep parsing HTML using regex, despite the perceived difficulty in reading and maintaining it (and that's putting correctness aside for now).

So my question is: why?

Is it because it's easy to learn?
Is it because it's faster?
Is it because it's the industry standard?
Is it because there are already so many reusable regexes to build from?
Is it because 100% correctness is never really the objective? (90% good enough?)
etc...

I'd also like to hear from the downvoters why they did so. Is it because:

There's absolutely nothing wrong with using regex to parse HTML and asking "Why?" is just dumb?
The premise of the question is flawed because the people who are using regex to parse HTML is such a small minority?

© Stack Overflow or respective owner

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of variables: variables = ['VariableA', 'VariableB','VariableC'] which I'm going to search for, line by line ifile = open("temp.txt",'r') d = {} match = zeros(len(variables)) for line in ifile: emptyCells=0 for i in range(len(variables)): regex = r'('+variables[i]+r')[:|=|\(](-… >>> More
OWASP Regex Repository: Is this regex correct?

as seen on Stack Overflow - Search for 'Stack Overflow'
I was looking at the regular expression for validating various data types from the (OWASP Regex Repository). One of the regular expressions in there is called safetext and looks like: ^[a-zA-Z0-9\s.\-]+$ My first question is: Is this regular expression correct? complementary question If this… >>> More
Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java. The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough. But I want to support several different regex… >>> More
JS regex isn't matching, even thought it works with a regex tester

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm writing a piece of client-side javascript code that takes a function and finds the derivative of it, however, the regex that's supposed to match with the power rule fails to work in the context of the javascript program, even though it sucessfully matches when it's used with an independent regex… >>> More
c# RegEx with "|"

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test". I'm no regex hero so I just tried a couple of things, like: Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash Regex Pattern… >>> More

Related posts about best-practices

Batch Best Practices and Technical Best Practices Updated

as seen on Oracle Blogs - Search for 'Oracle Blogs'
The Batch Best Practices for Oracle Utilities Application Framework based products (Doc Id: 836362.1) and Technical Best Practices for Oracle Utilities Application Framework Based Products (Doc Id: 560367.1) have been updated with updated and new advice for the various versions of the Oracle… >>> More
Partial template specialization of free functions - best practices

as seen on Stack Overflow - Search for 'Stack Overflow'
As most C++ programmers should know, partial template specialization of free functions is disallowed. For example, the following is illegal C++: template <class T, int N> T mul(const T& x) { return x * N; } template <class T> T mul<T, 0>(const T& x) { return T(0); } //… >>> More
Object Oriented PHP Best Practices

as seen on Stack Overflow - Search for 'Stack Overflow'
Say I have a class which represents a person, a variable within that class would be $name. Previously, In my scripts I would create an instance of the object then set the name by just using: $object->name = "x"; However, I was told this was not best practice? That I should have a function set_name()… >>> More
Notification Email Best Practices--From Server Setup to Programming

as seen on Stack Overflow - Search for 'Stack Overflow'
All, I'm in the process now of building a SaaS tool that allows network admins to generate notification emails to the members of the end-users of our platform (among many many other things). I'm running into a bit of an "out of my expertise" wall, as I know there are a lot of variables involved… >>> More
Good place to look for example Database Designs - Best practices

as seen on Stack Overflow - Search for 'Stack Overflow'
I have been given the task to design a database to store a lot of information for our company. Because the task is rather big and contains multiple modules where users should be able to do stuff, I'm worried about designing a good data model for this. I just don't want to end up with a badly designed… >>> More